fix(weave): refuse code-bearing custom objects on server-side decode#7004
Conversation
Server-side workers (the evaluate-model worker) reconstruct user-supplied objects. Gate custom-object deserialization on a per-client policy (WeaveClient.allow_unsafe_custom_obj_decode, default True) so workers can refuse to reconstruct Op / load_op-backed custom types at decode time, including dataset rows materialized lazily during evaluation. https://coreweave.atlassian.net/browse/WB-34909 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…il closed Refines the server-side decode guard (WB-34909): - is_safe_to_decode is now type-level: data-only custom types (images, audio, ...) decode via their in-process serializer, so worker evals on multimodal datasets keep working. Op and unknown types are still refused. - Block the packaged load_op fallback when unsafe decode is off, closing the force-fallback path (name a safe type, ship bad bytes so its serializer throws, then reach the attacker's op). - No active client now fails closed (data-only still decodes) with an actionable warning, instead of fail-open. - Wire OP_CUSTOM_WEAVE_TYPE into the sync test (was dead) and document the process-global client dependency the guard relies on. - Fix a KNOWN_TYPES test leak that rebound the name instead of restoring the set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename WeaveClient.allow_unsafe_custom_obj_decode -> _allow_unsafe_custom_obj_decode so the scary-named, server-only policy doesn't surface in user autocomplete. - Tighten the decode-guard comments toward what the code does (correctness) rather than attack framing; fix a stale test-name reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| """ | ||
| ref = Ref.parse_uri(ref_uri) | ||
| if not isinstance(ref, ObjectRef): | ||
| if not isinstance(ref, ObjectRef) or isinstance(ref, OpRef): |
There was a problem hiding this comment.
Why do you need the OpRef check? Now the func name is a bit confusing
There was a problem hiding this comment.
the decode guard in custom_objs.py fires on CustomWeaveType payloads so technically if you had an Op inside a dataset row, we would then blindly call the load_op, so we just check to make sure you don't have an op here.
I think this also protects the initial worker call: client.get(Ref.parse_uri(evaluation_ref)). If evaluation_ref is an op ref, client.get loads and runs that op directly...
|
I attempted something similar here (now closed): #6623 |
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "encoded", |
| # Internal: server-side workers (eval, scoring) flip this off so deserialization | ||
| # never reconstructs code-bearing custom objects. See custom_objs.py. | ||
| self._allow_unsafe_custom_obj_decode = True |
There was a problem hiding this comment.
This seems like something I should be able to set from an env var. Customers might also want to disable remote code execution, even if it's supposedly their own code.
There was a problem hiding this comment.
hmm, but this is the weave client, being run in the weave worker 🤔 not sure how we expose it to our backend as an env var but not to users. It would be totally useless if a user set this in their own client, or worse, it would break things that are fine for users on their own compute.
There was a problem hiding this comment.
It would be totally useless if a user set this in their own client, or worse, it would break things that are fine for users on their own compute.
As a user of weave I want the ability to disable local execution of code that's hosted remotely. Are many users relying on deserializing code-bearing objects? I know some users publish and share ops, but I don't have the sense that it's especially common.
Also, if there was an env var then we could set ALLOW_UNSAFE_CUSTOM_OBJ_DECODE=false in the helm charts and disable the behavior globally, rather than doing it in code service-by-service.
- rename _assert_object_ref -> _assert_non_op_ref (OpRef subclasses ObjectRef) - add require_secure_weave_client helper; worker uses it - add ids= to parametrized decode tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…9-review # Conflicts: # weave/trace_server/workers/evaluate_model_worker/evaluate_model_worker.py
jtschoonhoven
left a comment
There was a problem hiding this comment.
A few comments. I still recommend reading _allow_unsafe_custom_obj_decode from an env var, but otherwise this looks like a straightforward improvement.
| # Internal: server-side workers (eval, scoring) flip this off via | ||
| # `require_secure_weave_client` so deserialization never reconstructs | ||
| # code-bearing custom objects. See custom_objs.py. | ||
| self._allow_unsafe_custom_obj_decode = True |
There was a problem hiding this comment.
If I was a user and I read this block I would feel concerned 🤨
Probably worth writing the comment with customers in mind, to explain why this should be safe for them.
Also as a user I might want the same mechanism to disable code-bearing custom objects.
| # An op ref passed as the evaluation/model ref would be loaded and run by | ||
| # `client.get` directly, before the decode guard applies, so reject it here. | ||
| _assert_non_op_ref(args.evaluation_ref, "evaluation_ref") | ||
| _assert_non_op_ref(args.model_ref, "model_ref") |
There was a problem hiding this comment.
I think this comment is wrong and client.get now would raise an UnsafeDeserializationError. So the _assert_non_op_ref isn't necessary. But if I'm wrong that seems like a problem, since client.get should definitely check _allow_unsafe_custom_obj_decode
… op-ref check Adds `WEAVE_ALLOW_UNSAFE_CUSTOM_OBJ_DECODE` (UserSettings) so deployments can disable code-bearing custom-obj decode globally and users can opt into hardening; the decode guard ANDs the setting with the per-client flag. Removes the redundant `_assert_non_op_ref` (op refs already raise UnsafeDeserializationError via the decode guard on client.get) and adds File to the safe/known type sets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression guard for the deleted `_assert_safe_ref` pre-check (WB-34909). An op ref passed directly as evaluation_ref/model_ref must raise UnsafeDeserializationError at `client.get` decode time, not import user code. Currently green; nothing else pins that the decode guard subsumes the dropped recursive payload check, so a future refactor of op loading could silently re-open RCE on the eval worker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Op, or any custom type that hits the packagedload_opfallback, runs user-uploaded code. AddsWeaveClient.allow_unsafe_custom_obj_decode(defaultTrue); the worker flips it off._decode_custom_objstill decodes data-only types (images, audio) via their in-process serializer, but refusesOp/unknown types and theload_opfallback, so multimodal eval datasets keep working while code never loads. No active client fails closed.client.getwould run before decode applies. Tracking: WB-34909Testing
unit tests for
is_safe_to_decode+ the ref guard +KNOWN_TYPESsync; decode tests for data-only decode under the worker policy,Oprefusal, and the forgedload_opfallback; e2e eval rejects anOpdataset row at decode time.